In [1]:
%pylab inline
%qtconsole


Populating the interactive namespace from numpy and matplotlib

In [10]:
import sklearn
from sklearn import cluster,datasets

Dataset

Let's generate a simple gaussian mixture to benchmark K-Means.


In [7]:
N=10**4
D_gen=2
K_gen=10

In [27]:
X,A_true=sklearn.datasets.make_blobs(n_samples=N, n_features=D_gen, centers=K_gen, cluster_std=1.0, center_box=(-100.0, 100.0))

K-Means


In [12]:
K=10
N_inits=500
max_iter=20

In [17]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
                                             precompute_distances=True, n_init=N_inits,
                                             max_iter=max_iter, random_state=None, copy_x=True,
                                             n_jobs=1)


1 loops, best of 3: 5.07 s per loop

The documentation claims that the precompute_distances (True by default) will consume more memory but make the algorithm faster. For the our goal (run K-Means many times with a small number of iterations) this is not the case, as shown by the result below. The documentation does not offer insight on the influence that this parameter is having under the hood.


In [18]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random', 
                                            precompute_distances=False, n_init=N_inits, 
                                            max_iter=max_iter, random_state=None, copy_x=True, 
                                            n_jobs=1)


1 loops, best of 3: 3.66 s per loop

The K-Means function from the scikit-learn already supports CPU parallelization by changing a simple parameter in the function. As expected, using multiple cores resulted in a speedup, albeit small. The parallelization is being done on the pairwise distances computation. It breaks down the pairwise matrix by the number of jobs selected.


In [21]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
                                             precompute_distances=True, n_init=N_inits,
                                             max_iter=max_iter, random_state=None, copy_x=True,
                                             n_jobs=-1)


1 loops, best of 3: 6.63 s per loop

In [24]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
                                             precompute_distances=False, n_init=N_inits,
                                             max_iter=max_iter, random_state=None, copy_x=True,
                                             n_jobs=-1)


1 loops, best of 3: 2.29 s per loop

It should be noted that it seems that this implementation is being accelerated by compiled C code.n

This are the times from the code above before the installation of optimized MKL sklearn (from the Anaconda Accelerate addon).

1 loops, best of 3: 5.67 s per loop
1 loops, best of 3: 4.05 s per loop
1 loops, best of 3: 3.29 s per loop
1 loops, best of 3: 2.57 s per loop

CUDA

Following video tutorial on Youtube. Notebook available here.


In [2]:
import numbapro
from numbapro import jit, cuda, int32, float32, float64

#@jit(complex64(int32, float32, complex64), target="cpu")

In [3]:
numbapro.check_cuda()


------------------------------libraries detection-------------------------------
Finding cublas
	located at /home/chiroptera/anaconda/lib/libcublas.so.6.0.37
	trying to open library...	ok
Finding cusparse
	located at /home/chiroptera/anaconda/lib/libcusparse.so.6.0.37
	trying to open library...	ok
Finding cufft
	located at /home/chiroptera/anaconda/lib/libcufft.so.6.0.37
	trying to open library...	ok
Finding curand
	located at /home/chiroptera/anaconda/lib/libcurand.so.6.0.37
	trying to open library...	ok
Finding nvvm
	located at /home/chiroptera/anaconda/lib/libnvvm.so.2.0.0
	trying to open library...	ok
	finding libdevice for compute_20...	ok
	finding libdevice for compute_30...	ok
	finding libdevice for compute_35...	ok
-------------------------------hardware detection-------------------------------
Found 1 CUDA devices
id 0         GeForce GT 520M                              [SUPPORTED]
                      compute capability: 2.1
                           pci device id: 0
                              pci bus id: 1
Summary:
	1/1 devices are supported
PASSED
Out[3]:
True

In [5]:
#CPU version
@numbapro.vectorize(['float32(float32,float32)',
                     'float64(float64,float64)'],target='cpu')
def cpu_sincos(x,y):
    return math.sin(x) * math.cos(y)

#GPU version
@numbapro.vectorize(['float32(float32,float32)',
                     'float64(float64,float64)'],target='gpu')
def gpu_sincos(x,y):
    return math.sin(x) * math.cos(y)

In [6]:
n=10**6
x = np.linspace(0,np.pi,n)
y = np.linspace(0,np.pi,n)

print 'Data of type ',x.dtype

print 'Unoptimized\t',
%timeit np.sin(x) * np.cos(y)
print 'CPU vectorize\t',
%timeit cpu_sincos(x,y)
print 'GPU vectorize\t',
%timeit gpu_sincos(x,y)

del x,y


Data of type  float64
Unoptimized	10 loops, best of 3: 77.5 ms per loop
CPU vectorize	10 loops, best of 3: 73.2 ms per loop
GPU vectorize	100 loops, best of 3: 17.8 ms per loop

Unoptimized and CPU vectorize times are similar because sin() and cos() calls dominate time. GPU is quite faster already and would get even faster in a better GPU (current GPU has 48 CUDA cores).

All the data has to go from the RAM to the GPU memory, band back again. The work is distributed across all threads on the GPU and scheduling and memory management are taken cared of.

Another example


In [39]:
@numbapro.vectorize(['float32(float32,float32,float32,float32)'])
def cpu_powers(p,q,r,s):
    return math.sqrt(p**2 + q**3 + r**4 + s**5)

@numbapro.vectorize(['float32(float32,float32,float32,float32)'],target='gpu')
def gpu_powers(p,q,r,s):
    return math.sqrt(p**2 + q**3 + r**4 + s**5)

When target is not specified, it defaults to 'cpu'.


In [40]:
## Benchmarking
#Generate data
n=5e6
p = np.random.random(n).astype(np.float32)
q = np.random.random(n).astype(np.float32)
r = np.random.random(n).astype(np.float32)
s = np.random.random(n).astype(np.float32)

print 'Unoptimized\t',
%timeit np.sqrt(p**2 + q**3 + r**4 + s**5)
print 'CPU vectorize\t',
%timeit cpu_powers(p,q,r,s)
print 'GPU vectorize\t',
%timeit gpu_powers(p,q,r,s)

del p,q,r,s


Unoptimized	1 loops, best of 3: 2.99 s per loop
CPU vectorize	1 loops, best of 3: 2.25 s per loop
GPU vectorize	10 loops, best of 3: 80 ms per loop

Anaconda Accelerate includes the MKL optimization to several libraries. I still have to check if the numpy routines are not faster than the normal ones.

The speedup of GPU over CPU is of $$2.1 s / 69.4 ms = 30.26$$

We've now seen that @vectorize accelerates functions, specially on GPU, but it only works on scalar functions. Even though more complex operations like matrix multiplication can be decomposed in several scalar operations over its elements, it would be more convenient to operate on arrays directly.

GUVectorize and General Universal Functions (ufunc)

Now the function can't return anything and we should pass as a parameter where we want our result to be stored. Furthermore, it's now clear that the types are arrays and we have an extra descriptive string for the shape signature.


In [7]:
@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],'(m,n),(n,p)->(m,p)',target='gpu')
def batch_matrix_mult(a,b,c):
    for i in range(c.shape[0]):
        for j in range(c.shape[1]):
            tmp=0
            for n in range(a.shape[1]):
                tmp += a[i,n] * b[n,j]
            c[i,j]=tmp
#batch_matrix_mult.max_blocksize = 32  # limits to 32 threads per block

This is the code from the video. It seems to be a simple matrix multiplication though. In the video they say that it's a batch multuplication...


In [9]:
n=4e6
dim=2

print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'

a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)


print 'Exception when too much resources of GPU are used:'
try:
    batch_matrix_mult(a,b)
except Exception as exception:
    print exception

batch_matrix_mult.max_blocksize = 64  # limits to 32 threads per block    

import numpy.core.umath_tests as ut

print 'Unoptimized\t',
#c = np.dot(a,b)
%timeit c = ut.matrix_multiply(a,b)

print 'GUVectorize GPU\t',
%timeit c2 = batch_matrix_mult(a,b)

del n,dim,a,b


Memory for both matrices:	125000.0  KBytes
Exception when too much resources of GPU are used:
Unoptimized	1 loops, best of 3: 169 ms per loop
GUVectorize GPU	1 loops, best of 3: 255 ms per loop

In [23]:
del n,dim,a,b

From the documentation:

There are times when the gufunc kernel uses too many of a GPU’s resources, which can cause the kernel launch to fail. The user can explicitly control the maximum size of the thread block by setting the max_blocksize attribute on the compiled gufunc object.

To control the size of the thread block we use (e.g. for 32 threds per block):

very_complex_kernel.max_blocksize = 32

The speedup was of

Manual memory management

The performance increases when the memory is manually managed. We'll run the same test as before, but now we'll copy the arrays to the GPU's memory and also allocate memory for the result.


In [11]:
n=4e6
dim=2

print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'

a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)

dc = numbapro.cuda.device_array_like(a)
da = numbapro.cuda.to_device(a)
db = numbapro.cuda.to_device(b)

def check_pure_compute_time(da,db,dc):
    batch_matrix_mult(da,db,out=dc)
    numbapro.cuda.synchronize()
    
import numpy.core.umath_tests as ut

print 'Unoptimized\t',
#c = np.dot(a,b)
%timeit c = ut.matrix_multiply(a,b)

print 'Memory management'
batch_matrix_mult.max_blocksize = 64  # limits to 32 threads per block    

print 'Automatic:\t\t',
%timeit c=batch_matrix_mult(a,b)
print 'Manual:\t\t\t',
%timeit check_pure_compute_time(da,db,dc)

del n,dim,a,b,da,db,dc


Memory for both matrices:	125000.0  KBytes
Unoptimized	1 loops, best of 3: 161 ms per loop
Memory management
Automatic:		1 loops, best of 3: 239 ms per loop
Manual:			10 loops, best of 3: 115 ms per loop

CUDA Execution Model

Kernel function

  • are only visible o the host CPU
  • cannot return any value (use output argument)
  • associates to a grid

A grid is a group of blocks (that can be 1D,2D or 3D) and each block is a group of threads (which can also be 1D, 2D or 3D). Every thread will execute the same kernel function.

When we're using @jit, we're writing functions on the CUDA execution model. However, we wrote functions that executed on GPU before with a return type (which can't happen here). That is because before we were using the @vectorize and not working under the CUDA execution model.

Matrix multiplication test

DEBUG


In [12]:
@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],'(m,n),(n,p)->(m,p)',target='gpu')
def matrix_mult(a,b,c):
    for i in range(c.shape[0]):
        for j in range(c.shape[1]):
            tmp=0
            for n in range(a.shape[1]):
                tmp += a[i,n] * b[n,j]
            c[i,j]=tmp
#batch_matrix_mult.max_blocksize = 32  # limits to 32 threads per block

In [13]:
n=4e6
dim=2

print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'

a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)

dc = numbapro.cuda.device_array_like(a)
da = numbapro.cuda.to_device(a)
db = numbapro.cuda.to_device(b)

def check_pure_compute_time(da,db,dc):
    batch_matrix_mult(da,db,out=dc)
    numbapro.cuda.synchronize()
    
del a,b,da,db,dc


Memory for both matrices:	125000.0  KBytes

In [47]:
tpb=10e6 / (10*10)
print tpb
print np.sqrt(tpb)


100000.0
316.227766017

In [46]:
from numbapro import *
from timeit import default_timer as timer

m=1000
n=1000
A = np.array(np.random.random((n,m)), dtype=np.float32)
C = np.empty([n,n])
B = np.array(np.random.random((m,n)), dtype=np.float32)

%timeit X=np.dot(A,B)

@cuda.jit(void(float32[:,:],float32[:,:],float32[:,:]))
def cu_square_matrix_mul(A, B, C):

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    
    bw = cuda.blockDim.x
    bh = cuda.blockDim.y
    
    x = tx + bx * bw
    y = ty + by * bh
    
    n = C.shape[0]

    if x >= n or y >= n:
        return

    cs = 0
    for i in range(n):
        cs += A[y,i]*B[i,x]
    C[y,x]= cs

    cuda.syncthreads()
    
def check_pure_compute_time(dA,dB,dC):
    stream = cuda.stream()
    blockdim = 10,10
    griddim = 10,10
    cu_square_matrix_mul[griddim,blockdim,stream](dA, dB,dC)
    dC.to_host()
    stream.synchronize()



stream = cuda.stream()
dA = cuda.to_device(A, stream)
dC = cuda.to_device(C, stream)
B = np.array(np.random.random((m,n)), dtype=np.float32)
dB = cuda.to_device(B, stream)

check_pure_compute_time(dA,dB,dC)       
"""
print
print("Numpy took    %f seconds" % numpy_time)
print("CUDA JIT took %f seconds, %.5fx speedup" % (cuda_time, numpy_time / cuda_time))"""


10 loops, best of 3: 46 ms per loop
---------------------------------------------------------------------------
CudaAPIError                              Traceback (most recent call last)
<ipython-input-46-99de67136a24> in <module>()
     49 dB = cuda.to_device(B, stream)
     50 
---> 51 check_pure_compute_time(dA,dB,dC)
     52 """
     53 print

<ipython-input-46-99de67136a24> in check_pure_compute_time(dA, dB, dC)
     37     blockdim = 32,32
     38     griddim = 10,10
---> 39     cu_square_matrix_mul[griddim,blockdim,stream](dA, dB,dC)
     40     dC.to_host()
     41     stream.synchronize()

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/compiler.pyc in __call__(self, *args, **kwargs)
    305                           blockdim=self.blockdim,
    306                           stream=self.stream,
--> 307                           sharedmem=self.sharedmem)
    308 
    309     def bind(self):

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/compiler.pyc in _kernel_call(self, args, griddim, blockdim, stream, sharedmem)
    345                                    sharedmem=sharedmem)
    346         # invoke kernel
--> 347         cu_func(*args)
    348 
    349         if self.debug:

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.pyc in __call__(self, *args)
   1043 
   1044         launch_kernel(self.handle, self.griddim, self.blockdim,
-> 1045                       self.sharedmem, streamhandle, args)
   1046 
   1047     @property

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.pyc in launch_kernel(cufunc_handle, griddim, blockdim, sharedmem, hstream, args)
   1087                           hstream,
   1088                           params,
-> 1089                           None)
   1090 
   1091 

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.pyc in safe_cuda_api_call(*args)
    213         def safe_cuda_api_call(*args):
    214             retcode = libfn(*args)
--> 215             self._check_error(fname, retcode)
    216 
    217         setattr(self, fname, safe_cuda_api_call)

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.pyc in _check_error(self, fname, retcode)
    243             errname = ERROR_MAP.get(retcode, "UNKNOWN_CUDA_ERROR")
    244             msg = "Call to %s results in %s" % (fname, errname)
--> 245             raise CudaAPIError(retcode, msg)
    246 
    247     def get_device(self, devnum=0):

CudaAPIError: Call to cuLaunchKernel results in CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

In [25]:


In [29]:
my_gpu = numba.cuda.get_current_device()
print "Running on GPU:", my_gpu.name
cores_per_capability = {1: 8,2: 32,3: 192,}
cc = my_gpu.compute_capability
print "Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)"
majorcc = cc[0]
print "Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT
cores_per_multiprocessor = cores_per_capability[majorcc]
print "Number of cores per mutliprocessor:", cores_per_multiprocessor
total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
print "Number of cores on GPU:", total_cores


Running on GPU: GeForce GT 520M
Compute capability:  2.1 (Numba requires >= 2.0)
Number of streaming multiprocessor: 1
Number of cores per mutliprocessor: 32
Number of cores on GPU: 32

In [19]:
%qtconsole

Distance calculation


In [2]:
import numbapro
from numbapro import *

In [88]:
def dist(a,b,c):
    N,K = c.shape
    dim = a.shape[1]
   
    for n in range(N):
        for k in range(K):
            dist=0
            for d in range(dim):
                diff = a[n,d]-b[k,d]
                dist += np.power(diff,2)
            c[n,k]=dist

In [ ]:
def dist2(a,b,c):
    N,K = c.shape
    dim = a.shape[1]
   
    for k in range(K):
        dist = a - b[k]
        dist = dist ** 2
        c[:,k] = dist.sum(axis=1)

In [23]:
@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],
                      '(m,n),(p,n)->(m,p)',target='gpu')
def guvect_dist(a,b,c):
    N,K = c.shape
    dim = a.shape[1]
   
    for n in range(N):
        for k in range(K):
            dist=0
            for d in range(dim):
                diff = a[n,d]-b[k,d]
                #dist += np.power(diff,2)
                dist += diff ** 2
            c[n,k]=dist

In [179]:
@numbapro.cuda.jit("void(float32[:,:], float32[:,:], float64[:,:])")
def cu_dist(a,b,c):
    """
    # thread ID inside block
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    
    # block ID
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    
    # block dimensions
    bw = cuda.blockDim.x
    bh = cuda.blockDim.y
    
    # compute thread's x and y index (i.e. datapoint and cluster)
    n = tx + bx * bw # row for each datapoint
    k = ty + by * bh # column for each cluster
    """
    
    n,k = numbapro.cuda.grid(2)
    
    ch, cw = c.shape # c width and height

    if n >= ch or k >= cw:
        return

    dist = 0.0
    for d in range(a.shape[1]):
        diff = a[n,d]-b[k,d]
        dist += diff ** 2
    c[n,k]= dist

    #cuda.syncthreads()
    
def cu_dist_wrap(a,b,c,blockDim,gridDim):
    dA = cuda.to_device(a)
    dB = cuda.to_device(b)
    #dC = cuda.to_device(c)
    #dC = numbapro.cuda.device_array(c)
    dC = numbapro.cuda.device_array_like(c)
    #stream = cuda.stream()
    cu_dist[gridDim,blockDim](dA,dB,dC)
    #dC.to_host()
    #stream.synchronize() #block untill stream is done
    dC.copy_to_host(ary=c)

In [171]:
#%debug
##generate data
n = 1e4
d = 2
k = 20

total_bytes = (n * d + k * d + n * k) * 4
print 'Memory used by arrays:\t',total_bytes/1024,'\tKBytes'
print '\t\t\t',total_bytes/(1024*1024),'\tMBytes'

data = np.random.random((n,d)).astype(np.float32)
clusters = np.random.random((k,d)).astype(np.float32)
#dists = np.empty((n,k)).astype(np.float32)
distsDim = np.int(n),np.int(k)

blockDim = 8,6 # GT520M has 48 CUDA cores

numBlocks = np.ceil((n * k) / 48)
gridSquareSide = np.int(np.ceil(np.sqrt(numBlocks)))
gridDim = gridSquareSide,gridSquareSide

print 'Excesss threads:\t',
print  np.prod(gridDim)*np.prod(blockDim) - (n * k)

print '\n','Shape of dists:',dists.shape


Memory used by arrays:	859.53125 	KBytes
			0.839385986328 	MBytes
Excesss threads:	2800.0

Shape of dists: (10000, 20)

In [172]:
distsCUDA = np.empty(distsDim)
distsCPU = np.empty(distsDim)

dist2(data,clusters,distsCPU)
cu_dist_wrap(data,clusters,distsCUDA,blockDim,gridDim)

In [173]:
#%timeit -r 1 dist(data,clusters,dists)
#%timeit -r 1 guvect_dist(data,clusters,dists)
%timeit cu_dist_wrap(data,clusters,distsCUDA,blockDim,gridDim)
#print type(dists)
#del data,clusters,dists


100 loops, best of 3: 3.01 ms per loop

In [85]:
speedup=2.92 / 2.58e-3
print speedup


1131.78294574